Latest version: 21 January, 2021


TODO:




1 Introduction

This is the entire workflow starting from transcriptome assembly all the way down to the search of differentially expressed genes (DEG), the analysis of gene ontology (GO) terms, and the comparison with datasets in M. cinxia and D. plexippus. All R functions starting with “my_” are home-maid and cannot be found in packages (though often they are just wrappers around functions from DESeq2). All functions can be found in the file functions.R.

2 Transcriptome assembly:

2.1 Preparing reads for assembly:

12 butterfly transcriptomes were sequenced in two batches (2 x 100 bp) Number of reads:

Sample total number of reads
A 25707028
B 29316918
C 26724523
D 30134453
E 32161097
F 33247006
G 29139146
H 25737323
I 29401271
J 34389745
K 32143200
L 22319576

First, overlapping reads were merged using leehom:

~/bin/leeHom/src/leeHom -fq1 A_R1.fq -fq2 A_R2.fq -fqo leehomA --verbose --log leehomlogA &
~/bin/leeHom/src/leeHom -fq1 B_R1.fq -fq2 B_R2.fq -fqo leehomB --verbose --log leehomlogB &
~/bin/leeHom/src/leeHom -fq1 C_R1.fq -fq2 C_R2.fq -fqo leehomC --verbose --log leehomlogC &
~/bin/leeHom/src/leeHom -fq1 D_R1.fq -fq2 D_R2.fq -fqo leehomD --verbose --log leehomlogD &
~/bin/leeHom/src/leeHom -fq1 E_R1.fq -fq2 E_R2.fq -fqo leehomE --verbose --log leehomlogE &
~/bin/leeHom/src/leeHom -fq1 F_R1.fq -fq2 F_R2.fq -fqo leehomF --verbose --log leehomlogF &
~/bin/leeHom/src/leeHom -fq1 G_R1.fq -fq2 G_R2.fq -fqo leehomG --verbose --log leehomlogG &
~/bin/leeHom/src/leeHom -fq1 H_R1.fq -fq2 H_R2.fq -fqo leehomH --verbose --log leehomlogH &
~/bin/leeHom/src/leeHom -fq1 I_R1.fq -fq2 I_R2.fq -fqo leehomI --verbose --log leehomlogI &
~/bin/leeHom/src/leeHom -fq1 J_R1.fq -fq2 J_R2.fq -fqo leehomJ --verbose --log leehomlogJ &
~/bin/leeHom/src/leeHom -fq1 K_R1.fq -fq2 K_R2.fq -fqo leehomK --verbose --log leehomlogK &
~/bin/leeHom/src/leeHom -fq1 L_R1.fq -fq2 L_R2.fq -fqo leehomL --verbose --log leehomlogL &

Then, all merged reads were concatenated, as well as all non-merged left and right reads:

cat leehomA.fq leehomB.fq leehomC.fq leehomD.fq leehomE.fq leehomF.fq leehomG.fq leehomH.fq leehomI.fq leehomJ.fq leehomK.fq leehomL.fq > mergedreads.fq
cat leehomA_r1.fq leehomB_r1.fq leehomC_r1.fq leehomD_r1.fq leehomE_r1.fq leehomF_r1.fq leehomG_r1.fq leehomH_r1.fq leehomI_r1.fq leehomJ_r1.fq leehomK_r1.fq leehomL_r1.fq > mergedreads_r1.fq
cat leehomA_r2.fq leehomB_r2.fq leehomC_r2.fq leehomD_r2.fq leehomE_r2.fq leehomF_r2.fq leehomG_r2.fq leehomH_r2.fq leehomI_r2.fq leehomJ_r2.fq leehomK_r2.fq leehomL_r2.fq > mergedreads_r2.fq

Next, Trimmomatic was used to trim off low-quality base calls and adapter sequences. The first command is for the reads merged by leehom, the second is for the read pairs that leehom kept apart. The file compound_PE.fa contains the adapters used for sequencing, and their reverse complements.

java -jar ~/jar/trimmomatic-0.33.jar SE -threads 30 -phred33 -trimlog logmerged mergedreads.fq M_singleOUT.fq ILLUMINACLIP:/home/rik/jar/Trimmomatic-0.33/adapters/compound_PE.fa:3:27:7:2 LEADING:10 TRAILING:10 SLIDINGWINDOW:3:10 MINLEN:70 &
java -jar ~/jar/trimmomatic-0.33.jar PE -threads 30 -phred33 -trimlog logpaired mergedreads_r1.fq mergedreads_r2.fq L_pairedOUT_p.fq L_pairedOUT_s.fq R_pairedOUT_p.fq R_pairedOUT_s.fq ILLUMINACLIP:/home/rik/jar/Trimmomatic-0.33/adapters/compound_PE.fa:3:27:7:2 LEADING:10 TRAILING:10 SLIDINGWINDOW:3:10 MINLEN:70 &

2.2 De Novo assembly

Finally, we used Trinity to assemble all this into one reference transcriptome:

~/trinityrnaseq-v2.11.0/Trinity --NO_SEQTK --include_supertranscripts --full_cleanup --min_per_id_same_path 96 --output trinity_96 --no_normalize_reads --seqType fq --min_kmer_cov 1 --left /floppy/rik/pieris/phase3/left_and_single.fq.gz --right /floppy/rik/pieris/phase3/R_pairedOUT_p.fq.gz --CPU 60 --max_memory 210G 1>run.log96 2>error96 &

3 Exploring the crude assembly

3.1 General assembly stats:

TODO: run assembly stats when I have access to HAL again.

3.2 Content: BLASTp in databases of a variety of organisms

Based on our preliminary analyses (ribosomal protein survey) there were traces of a number of contaminants. We downloaded proteomes of potential contaminants, and ran BLASTx. Those BLASTx can inform us about the nature of contaminants, and can later also help to filter out contigs and to build a final reference transcriptome. The organisms in which we executed these searches can be found in the table under the next codeblock.

BLASTx in proteomes of lepidopterans, arthropods and potential contaminants:

for database in $(ls /floppy/rik/pieris/REF2020/blastp_databases/ | grep ".fa$" ); 
do blastx -query trinity_96.Trinity.fasta -db /floppy/rik/pieris/REF2020/blastp_databases/$database -num_threads 20 -max_target_seqs 5 -outfmt 6 > OUT/blastx.$database; 
done

BLASTx in Pieris rapae proteome:

blastx -query trinity_96.Trinity.fasta -db ../rapae_loci/pieris_rapae_predicted_prot.fa -num_threads 20 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore slen qlen" -evalue 1e-3 > blast.outfmt6

The output of these all these BLASTx searches was read into R:

B <- read.table("~/pieris/REF2020/denovo/blast.outfmt6",col.names=c("qseqid","sseqid","pident","length","mismatch","gapopen","qstart","qend","sstart","send","evalue","bitscore","slen","qlen"))
### Next all potential contaminants plus a series of arthropods:

# Lepidopterans: 
lepid  <- c("BOMMO","DAPLE","OPBRU","PAPMA","PAPXU","HELVI","HELAM","CHISP","LEPSI")
# Other arthropods:
arthr  <- c("DROME","TRICA","AEDAE","ANOGA","ZOONE","ACYPI","BLAGE","RHOPR","ORCCI","FOLCA","CRYSE","DAPPU","DAPMA")
# Other animals:
animal <- c("CAEEL","MOUSE","HUMAN","CHICK","SALSA")      
# Plants:
plant  <- c("WHEAT","BRAOL","ARATH","SOYBN")
# Eubacteria:
eubact <- c("PSEAE","ECOLI","BACSU")
# Other eukaryotes such as fungi, amoebozoa, apicomplexa, excavata
eukar  <- c("YEAST","DEBHA","NOSB1","ENCCU","ASPNG","AERPU","RHIO9","MUCC1","TOXGV","PLAF7","GRENI","ENTIN","ENTHI","ACACA","NAEGR")

X <- setNames(data.frame(matrix(ncol = 14, nrow = 0)), c("species","category","qseqid","sseqid","pident","length","mismatch","gapopen","qstart","qend","sstart","send","evalue","bitscore"))
for(category in c("lepid","arthr","animal","plant","eubact","eukar"))
{
    for (species in get(category))
    {
        x   <- cbind.data.frame("species"=species,"category"=category,read.table(paste("~/pieris/REF2020/denovo/OUT/blastx",species,"fa",sep="."),col.names=c("qseqid","sseqid","pident","length","mismatch","gapopen","qstart","qend","sstart","send","evalue","bitscore")))
        # Only keep the best bit score hits:
        x   <- x[x$evalue<0.01,]
        x   <- x[order(x$bitscore,decreasing=T),]
        x   <- x[!duplicated(x$qseqid),]
        X   <- rbind(X,x)
    }   
}

# First, take a backup
X_bkp <- X                   

# Add Pieris Rapae:
prap <- cbind.data.frame(species="PIRAP",category="lepid",B[,1:12])
X    <- rbind(prap,X)

# Cut off percentage of identity at 60%
X    <- X[X$pident>60,]      #  Goes from 3.98m to 957k lines

# Sort for bitscore:
# Given the order in which they were read in, for equal bitscores, 
# priority should be given to lepidoptera > arthropods > animals etc
X    <- X[order(X$bitscore,decreasing=T),]
XX   <- X[!duplicated(X$qseqid),]       # 95726 lines, which means ~96k Trinity contigs got a BLASTx hit

A quick look at what this looks like:

head(XX)
##        species category                   qseqid       sseqid pident length
## 228964   PIRAP    lepid TRINITY_DN17067_c1_g1_i1 LOC110996083 97.192   7871
## 228946   PIRAP    lepid TRINITY_DN17067_c1_g1_i2 LOC110996083 96.970   7889
## 980202   PIRAP    lepid  TRINITY_DN1291_c0_g1_i1 LOC110994581 98.406   6336
## 977944   PIRAP    lepid TRINITY_DN1291_c0_g1_i24 LOC110994581 98.264   6336
## 978697   PIRAP    lepid  TRINITY_DN1291_c0_g1_i4 LOC110994581 98.264   6336
## 979450   PIRAP    lepid TRINITY_DN1291_c0_g1_i18 LOC110994581 98.264   6336
##        mismatch gapopen qstart  qend sstart send evalue bitscore
## 228964      188       1    270 23783      1 7871      0    14978
## 228946      188       2    270 23837      1 7871      0    14967
## 980202       91       2  19008     1   2269 8594      0    12876
## 977944       91       3  18981     1   2269 8594      0    12850
## 978697       91       3  18981     1   2269 8594      0    12849
## 979450       91       3  18981     1   2269 8594      0    12848
table(XX$category)
## 
## animal  arthr eubact  eukar  lepid  plant 
##    116   1652      7   8188  75366  10397

All tested organisms, plus the number of best BLAST hits, can be found in the following table:

Some observations:

  • Most hits are in lepidopterans, and within lepidoptera Pieris rapae. Totally expected
  • Not many aspecific arthropod hits, and little phylogenetic signal within other arthropod hits (e.g. many hemipteran and dictyopteran hits)
  • Virtually no other animal contaminations and no hits to eubacteria (makes sense with a poly-A enriched set to start with)
  • Butterflies live on and off plants, so plant contaminations, totally expected.
  • Clearly some fungi. We know there is Nosema (confirmed), but probably other ascomycete (Saccharomyces, Debaryomyces, Aureobasidium, Aspergillius) and zygomycete (Mucor, Rhizopus) molds as well.

4 Selecting reference contigs

We will use the predicted Pieris rapae transcriptome as a guide to produce reference contigs for Pieris brassicae. To this end, we first select contigs that have best hits in lepidopterans. Next, these contigs will be aligned against the rapae transcriptome.

Contigs that got a hit in other arthropods better than lepidopterans, were typically rubbish hits of very small length. Therefore, we define the B dataframe, that has only contigs with best hits in lepidopterans.

lepidopteran_contigs <- XX$qseqid[XX$category=="lepid"]  # 75366 contigs
B                    <- B[B$qseqid %in% lepidopteran_contigs,]

Now we want to cut-off the Pieris rapae BLASTx results at a sensible “% identity” cut-off. First: check what is the distribution of % identity of best hitting per subject:

B                    <- B[order(B$bitscore,decreasing=T),]
# Save this version of B. Later we will adjust B, but if ever we want to replicate this figure...
B_bkp                <- B
B_sub                <- B[!duplicated(B$sseqid),]
hist(B_sub$pident,breaks=200,xlim=c(0,100),col="black",freq=F,xlab="percentage identity",main="percentage identity for blastx hits")
hist(B_bkp$pident,breaks=200,xlim=c(0,100),col="red",add=T,freq=F)
legend("topleft",col=c("red","black"),legend=c("all blast hits", "highest bitscore per subject"),pch=15)

Based on these graphs, a fair cut-off seems to be pident > 85%

B <- B[B$pident>85,]
B <- B[order(B$bitscore,decreasing=T),]
B <- B[order(B$sseqid),]
length(unique(B$sseqid)) # 11419 down from 12296
## [1] 10741

Now in a next step, for each query (contig), I only keep those hits on the rapae cds in which they got the best bit score:

B$combo       <- paste(B$qseqid,B$sseqid,sep="_")
B             <- B[order(B$combo),]
B$combosum    <- unlist(by(B$bitscore,B$combo,function(x){rep(sum(x),length(x))}))
B$combolength <- unlist(by(B$length,B$combo,function(x){rep(sum(x),length(x))}))
B            <- B[order(B$combosum,decreasing=T),]
uniquecombos <- B$combo[!duplicated(B$qseqid)]
B            <- B[B$combo %in% uniquecombos,]
length(unique(B$sseqid))  
## [1] 10741
# 10741 down from 11419
# This means that 678 rapae cds only had second best hits 

Now execute the scaffolds function (can be found in functions.R). The algorithm is as follows:

For each rapae contig:

The output of this function is a list with BLASTx output of selected brassicae contigs per rapae contig. Next, these can be used to make explorative plots, and also to generate scaffolds.

# Careful here, the scaffold function as it is scripted here, runs on 20 cores!
foobar<-parallel::mclapply(unique(B$sseqid),my_scaffold_fun,mc.cores=20L)
SCAFFOLDS <- do.call("rbind",foobar)
write.table(SCAFFOLDS,file="x.txt")
doubles          <- names(which(table(SCAFFOLDS$sseqid)>1))
SCAFFOLDS_double <- SCAFFOLDS[SCAFFOLDS$sseqid%in%doubles,]

And use the plotblastmap function (functions.R) to show what a random sample of these scaffolds look like:

set.seed(11)
par(mfrow=c(2,4));for(k in sample(doubles,16,replace=F)){plotblastmap(k,SCAFFOLDS)}

This is a good sample of what the scaffolds really look like. The x-axis represents the length of the rapae contig. Each line is a brassicae contig. The grey lines represent the full length of those contigs, and the red zone represents a BLAST match. The % identity is printed right beneath the match. For example, scaffold LOC111000361 consists of two brassicae contigs with little overlap. Many scaffolds consist of one brassicae contig with a good overall match, or several stretches of matches. The numbers of contigs within scaffolds are summarized as follows:


number of contigs: 1 2 3 4 5 5+
number of scaffolds: 9691 791 166 45 20 28

Now we need an output from this R session, that will as an input for a bash function that reads in brassicae contigs and scaffolds them. So, first, make sure that contigs are ordered per rapae CDS, according to their first position within the rapae CDS.

S <- SCAFFOLDS[order(SCAFFOLDS$sstart),]
S <- SCAFFOLDS[order(SCAFFOLDS$sseqid),]

Next, remove duplicated hits, and add a column to determine the strand. Retain only relevant columns.

S <- S[!duplicated(S$qseqid),]
S$strand <- c("-","+")[c(S$qstart < S$qend)+1]
S <- cbind(sseqid=as.character(S$sseqid),qseqid=as.character(S$qseqid), strand=as.character(S$strand))

Turn contig names into fasta identifiers so they are easier to pick up with a bash script. Put a space behind trinity identifiers. Replace every non-first mention of a rapae cds in a stretch of NNN

S[,1] <- paste(">",S[,1],sep="")
S[,1][duplicated(S[,1])] <- "NNNNNNNNNN"

The output looks something like this (12294 rows in total)

##       sseqid          qseqid                      strand
##  [1,] ">LOC110991522" "TRINITY_DN41485_c1_g1_i2"  "+"   
##  [2,] "NNNNNNNNNN"    "TRINITY_DN275329_c0_g1_i1" "-"   
##  [3,] ">LOC110991523" "TRINITY_DN32422_c0_g1_i1"  "-"   
##  [4,] ">LOC110991526" "TRINITY_DN79934_c0_g1_i1"  "-"   
##  [5,] ">LOC110991527" "TRINITY_DN15710_c0_g1_i1"  "-"   
##  [6,] "NNNNNNNNNN"    "TRINITY_DN8909_c0_g1_i1"   "-"   
##  [7,] "NNNNNNNNNN"    "TRINITY_DN9072_c0_g1_i1"   "-"   
##  [8,] ">LOC110991528" "TRINITY_DN150270_c0_g1_i1" "-"   
##  [9,] ">LOC110991529" "TRINITY_DN122_c0_g1_i1"    "-"   
## [10,] ">LOC110991530" "TRINITY_DN174228_c0_g1_i1" "-"   
## [11,] ">LOC110991531" "TRINITY_DN868_c2_g1_i1"    "-"   
## [12,] ">LOC110991532" "TRINITY_DN1977_c0_g1_i1"   "-"   
## [13,] ">LOC110991533" "TRINITY_DN403_c0_g1_i1"    "+"

And finally, write this out as a file that will serve as an input to a shell script:

write.table(x=S,file="scaffolds.txt",row.names=F,col.names=F,quote=F)

Quit R, and run the following function called fastamaker.sh:

#!/bin/bash
echo $1
if [ "$3" != "-" ]; then
    grep -A1 "$2\s" trinity_96.Trinity.fasta --no-group-separator | grep -v 'TRINITY'
else
    grep -A1 "$2\s" trinity_96.Trinity.fasta --no-group-separator | grep -v 'TRINITY' | rev | tr 'ACTG' 'TGAC'
fi

Go to the directory that contains the scaffolds.txt file and execute:

touch out.txt
while read line; 
do echo "$line" | xargs ./testfun.sh >> out.txt; 
done < scaffolds.txt

# Make it a singleline fasta (small function that turns a block fasta into singleline)
cat out.txt | singleline > rapae_based_brassicae_scaffolds.fasta

Now we have a reference containing a Brassicae scaffold for 10741 Rapae orthologues. Note that these scaffolds are potentially rough, because they were just stuck together with a small string of poly-N. However, for counting mappings this really does not matter much. What is important, is that for each locus, as much sequence information as possible, is present.

5 Pseudo-alignment with Kallisto Quant

First, build a kallisto index: {” bash, eval = F} ~/kallisto_linux-v0.46.1/kallisto index -i rapae_based_brassicae_scaffolds.fasta.index rapae_based_brassicae_scaffolds.fasta

Next, map the raw data with bias correction. We have tried with and without. It makes virtually no difference:

~/kallisto_linux-v0.46.1/kallisto quant -i ~/pieris/REF2020/denovo/rapae_based_brassicae_scaffolds.fasta.index -o kallisto_rapae_based_brassicae_bias.A --bias -b 100 -t 6 A_L3_R1.fq.gz A_L3_R2.fq.gz A_L6_R1.fq.gz A_L6_R2.fq.gz &

# Next, run for B, C, D etc

6 Annotation with Trinotate

For annotation, we the standard Trinotate pipeline was used. Se used Pieris rapae gene predictions, since they are generally more complete than our Pieris brassicae contigs.

First, a folder is made in which we will collect all information. Next, in this folder, we put the pfam and uniprot databases, which we prepare to be used in searches.

cd ~/Trinotate-v3.2.1
hmmpress Pfam-A.hmm
makeblastdb -in uniprot_sprot.pep -parse_seqids -dbtype prot

Since Trinotate requires data to be in a very specific format, we could not use the published protein predictions. Therefore, we re-ran CDS predictions using Transdecoder:

~/TransDecoder-TransDecoder-v5.5.0/TransDecoder.LongOrfs -t pieris_rapae_predicted_cds.fa
~/TransDecoder-TransDecoder-v5.5.0/TransDecoder.Predict  -t pieris_rapae_predicted_cds.fa --no_refine_starts

Now we can start running BLASTp, BLASTx, and hmmscan to generate the data that will populate our Trinotate database:

# BLASTp the protein predictions:
blastp -query pieris_rapae_predicted_cds.fa.transdecoder.pep -db ~/Trinotate-v3.2.1/uniprot_sprot.pep -num_threads 30 -max_target_seqs 1 -outfmt 6 -evalue 1e-3 > blastp.outfmt6 &
  
# Scan the protein predictions for protein domains:
hmmscan --cpu 30 --domtblout pierisPFAM.out ~/Trinotate-v3.2.1/Pfam-A.hmm pieris_rapae_predicted_cds.fa.transdecoder.pep > pfam.log &

# BLASTx the nucleotide sequences in known proteins:
blastx -query trinity_96.Trinity.fasta -db ../rapae_loci/pieris_rapae_predicted_prot.fa -num_threads 20 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore slen qlen" -evalue 1e-3 > blast.outfmt6

Generate the –gene_trans_map file that is required to initiate a Trinotate database, and initiate the database. Next, populate the database with all data that we assembled. Finally, we generate a tabular output which we can use for further analyses:

grep '>' pieris_rapae_predicted_cds.fa | cut -d'>' -f2 | cut -d' ' -f1 > locusnames
rapae_loci$ paste locusnames locusnames > trans_gene_map.txt

# Initiate database:
~/Trinotate-v3.2.1/Trinotate Trinotate.sqlite init --gene_trans_map trans_gene_map.txt --transcript_fasta pieris_rapae_predicted_cds.fa --transdecoder_pep pieris_rapae_predicted_cds.fa.transdecoder.pep

# Populate database:
~/Trinotate-v3.2.1/Trinotate Trinotate.sqlite LOAD_swissprot_blastp blastp.outfmt6 
~/Trinotate-v3.2.1/Trinotate Trinotate.sqlite LOAD_swissprot_blastx blastx.outfmt6 
~/Trinotate-v3.2.1/Trinotate Trinotate.sqlite LOAD_pfam pierisPFAM.out 

# Generate report:
~/Trinotate-v3.2.1/Trinotate Trinotate.sqlite report -E 1e-3 > trinotate_annotation_report.xls

7 Differential gene expression

7.1 Reading in data and setting up environment.

Open an R session, and read in all data and metadata. Load all necessary packages and functions. Next, we define subsets of the data (only August, only September, only butterflies that showed dispersal behaviour, only butterflies that did not show dispersal behaviour). Notice that, where subsets of our data are defined, the fourth sample is omitted. This is sample D, which shows severe gene deviations in patterns of gene expression, presumably because of an infection with a microsporidian parasite.

setwd("~/pieris/")
library("tximport")
library("DESeq2")
library("pheatmap")
library("vsn")
library("plotly")
library("lme4")
library("GO.db")
library("dplyr")
source("functions.R")
# DATA:
trinot <- read.table("~/pieris/rapae_loci/trinotate_annotation_report.xls",header=T,sep="\t",comment.char="",stringsAsFactors=F,quote="")

sample              <- factor(LETTERS[1:12])
originalname        <- factor(c("s17","s18","s33","s46","s56","s61","sD163","sD178","sD185","sD239","sDL47","sDL61")) 
clutch              <- factor(c("marion1","marion1","marion1","marion1","marion1","marion1","Delphine3","Delphine6","Delphine6","Delphine5","Delphine24","Delphine24"))
month               <- factor(c(rep("aug",6),rep("sep",6)))
sex                 <- factor(c("f","f","m","m","f","f","m","m","f","f","f","m"))
dispersotype        <- factor(c("d","r","r","d","d","r","d","r","d","r","r","d"))
path_brassicae      <- c("./mappings/kallisto_rapae_based_brassicae_bias.A/abundance.h5","./mappings/kallisto_rapae_based_brassicae_bias.B/abundance.h5","./mappings/kallisto_rapae_based_brassicae_bias.C/abundance.h5","./mappings/kallisto_rapae_based_brassicae_bias.D/abundance.h5","./mappings/kallisto_rapae_based_brassicae_bias.E/abundance.h5","./mappings/kallisto_rapae_based_brassicae_bias.F/abundance.h5","./mappings/kallisto_rapae_based_brassicae_bias.G/abundance.h5","./mappings/kallisto_rapae_based_brassicae_bias.H/abundance.h5","./mappings/kallisto_rapae_based_brassicae_bias.I/abundance.h5","./mappings/kallisto_rapae_based_brassicae_bias.J/abundance.h5","./mappings/kallisto_rapae_based_brassicae_bias.K/abundance.h5","./mappings/kallisto_rapae_based_brassicae_bias.L/abundance.h5") 
path_rapae          <- c("./mappings/kallisto_rapae_bias.A/abundance.h5","./mappings/kallisto_rapae_bias.B/abundance.h5","./mappings/kallisto_rapae_bias.C/abundance.h5","./mappings/kallisto_rapae_bias.D/abundance.h5","./mappings/kallisto_rapae_bias.E/abundance.h5","./mappings/kallisto_rapae_bias.F/abundance.h5","./mappings/kallisto_rapae_bias.G/abundance.h5","./mappings/kallisto_rapae_bias.H/abundance.h5","./mappings/kallisto_rapae_bias.I/abundance.h5","./mappings/kallisto_rapae_bias.J/abundance.h5","./mappings/kallisto_rapae_bias.K/abundance.h5","./mappings/kallisto_rapae_bias.L/abundance.h5") 
META                <- cbind.data.frame(sample,originalname,clutch,month,sex,dispersotype,path_brassicae,path_rapae)
META$path_brassicae <- as.character(META$path_brassicae)
META$path_rapae     <- as.character(META$path_rapae)

# Here, we remove sample D:
META_all <- META[-4,]
META_aug <- META_all[META_all$month=="aug",]
META_sep <- META_all[META_all$month=="sep",]
META_res <- META_all[META_all$dispersotype=="r",]
META_mig <- META_all[META_all$dispersotype=="d",]

# Since for PCA we will be looking at the dataset including sample D, we also read it in:
txi_withD   <- tximport(files=META$path_brassicae,type="kallisto",txOut=TRUE,varReduce=TRUE,readLength=100)
txi_all   <- tximport(files=META_all$path_brassicae,type="kallisto",txOut=TRUE,varReduce=TRUE,readLength=100)
txi_aug   <- tximport(files=META_aug$path_brassicae,type="kallisto",txOut=TRUE,varReduce=TRUE,readLength=100)
txi_sep   <- tximport(files=META_sep$path_brassicae,type="kallisto",txOut=TRUE,varReduce=TRUE,readLength=100)
txi_res   <- tximport(files=META_res$path_brassicae,type="kallisto",txOut=TRUE,varReduce=TRUE,readLength=100)
txi_mig   <- tximport(files=META_mig$path_brassicae,type="kallisto",txOut=TRUE,varReduce=TRUE,readLength=100)

The metadata looks like this:

Next, we use the function my_coveragefilter, a small function that selects within a txi object only those features above a coverage cutoff in a certain number of samples. In this case: at least 3 samples with at least coverage = 2.

txi_withD <- my_coveragefilter(txi_withD)
txi_all   <- my_coveragefilter(txi_all)
txi_aug   <- my_coveragefilter(txi_aug) 
txi_sep   <- my_coveragefilter(txi_sep)
txi_res   <- my_coveragefilter(txi_res)
txi_mig   <- my_coveragefilter(txi_mig)

The histograms of numbers of pseudomappings per scaffold all look more or less the same:

Next, define experimental designs, and import txi objects with standard parameters (meaning Wald type test):

dds_withD      <- DESeq(DESeqDataSetFromTximport(txi_withD,colData = META, design = ~ dispersotype + month))
dds_full       <- DESeq(DESeqDataSetFromTximport(txi_all,colData = META_all, design = ~ dispersotype + month))
dds_disp       <- DESeq(DESeqDataSetFromTximport(txi_all,colData = META_all, design = ~ dispersotype))
dds_month      <- DESeq(DESeqDataSetFromTximport(txi_all,colData = META_all, design = ~ month))
dds_sex        <- DESeq(DESeqDataSetFromTximport(txi_all,colData = META_all, design = ~ sex))
dds_aug        <- DESeq(DESeqDataSetFromTximport(txi_aug,colData = META_aug, design = ~ dispersotype))
dds_sep        <- DESeq(DESeqDataSetFromTximport(txi_sep,colData = META_sep, design = ~ dispersotype))
dds_res        <- DESeq(DESeqDataSetFromTximport(txi_res,colData = META_res, design = ~ month))
dds_mig        <- DESeq(DESeqDataSetFromTximport(txi_mig,colData = META_mig, design = ~ month))

7.2 Run inferences for DEG:

We defined some wrapper functions. my_deseqfun: executes differential gene expression analysis, and outputs a list with statistical inferences per gene, with information about metadata and annotation.

dds_withD      <- my_deseqfun(dds_withD,contrast=c("dispersotype", "d", "r"), meta=META,trinot=trinot)
dds_full_disp  <- my_deseqfun(dds_full,contrast=c("dispersotype", "d", "r"), meta=META_all,trinot=trinot)
dds_full_month <- my_deseqfun(dds_full,contrast=c("month", "aug", "sep"), meta=META_all,trinot=trinot)
dds_disp       <- my_deseqfun(dds_disp,contrast=c("dispersotype", "d", "r"), meta=META_all,trinot=trinot)
dds_month      <- my_deseqfun(dds_month,contrast=c("month", "aug", "sep"), meta=META_all,trinot=trinot)
dds_sex        <- my_deseqfun(dds_sex,contrast=c("sex", "f", "m"), meta=META_all,trinot=trinot)
dds_aug        <- my_deseqfun(dds_aug,contrast=c("dispersotype", "d", "r"), meta=META_aug,trinot=trinot)
dds_sep        <- my_deseqfun(dds_sep,contrast=c("dispersotype", "d", "r"), meta=META_sep,trinot=trinot)
dds_res        <- my_deseqfun(dds_res,contrast=c("month", "aug", "sep"), meta=META_res,trinot=trinot)
dds_mig        <- my_deseqfun(dds_mig,contrast=c("month", "aug", "sep"), meta=META_mig,trinot=trinot)

7.3 Principal component analysis of gene expression.

In order to see how the different samples relate to each other, we execute a principle component analysis (PCA) on the transpose of the normalized pseudo-alignment count matrices. We do this both for the count matrix of all 12 samples (including sample D, which is excluded from further analyses) and for 11 samples (D dropped). The PCA does not need to be scaled, because after transformation, the relationship between mean counts and their variance is extremely weak.

### We first need two datasets:
# the 12 sample one to show that sample D is off
# the 11 sample one.

X_11    <- t(dds_full_disp$dat.u)       # Doesn't really matter which one you choose (could equally well be e.g. dds_sex), as long as it contains 11 samples
X_12    <- t(dds_withD$dat.u)

pca_11  <- prcomp(X_11,scale=F)         # You don't have to scale, given there is an extremely weak relationship between variance and mean after the appropriate transfo's have been executed (upstream)
pca_12  <- prcomp(X_12,scale=F)

pcax_11 <- as.data.frame(pca_11$x)
colnames(pcax_11)<-c("pc1","pc2","pc3","pc4","pc5","pc6","pc7","pc8","pc9","pc10","pc11")
pcax_12 <- as.data.frame(pca_12$x)
colnames(pcax_12)<-c("pc1","pc2","pc3","pc4","pc5","pc6","pc7","pc8","pc9","pc10","pc11","pc11")

summary(pca_11)
## Importance of components:
##                            PC1     PC2      PC3      PC4      PC5     PC6
## Standard deviation     37.9319 23.9656 17.76087 16.54387 12.34070 11.5955
## Proportion of Variance  0.4451  0.1777  0.09759  0.08468  0.04712  0.0416
## Cumulative Proportion   0.4451  0.6228  0.72041  0.80509  0.85220  0.8938
##                             PC7      PC8     PC9    PC10      PC11
## Standard deviation     10.75452 10.23632 8.73928 6.81611 5.767e-14
## Proportion of Variance  0.03578  0.03242 0.02363 0.01437 0.000e+00
## Cumulative Proportion   0.92958  0.96200 0.98563 1.00000 1.000e+00
summary(pca_12)
## Importance of components:
##                            PC1     PC2     PC3      PC4     PC5      PC6
## Standard deviation     40.3003 34.9431 24.4636 17.05859 16.1984 12.49519
## Proportion of Variance  0.3466  0.2606  0.1277  0.06211  0.0560  0.03332
## Cumulative Proportion   0.3466  0.6072  0.7350  0.79707  0.8531  0.88640
##                             PC7      PC8      PC9    PC10    PC11     PC12
## Standard deviation     12.19756 11.08680 10.86719 9.34664 7.42491 6.46e-14
## Proportion of Variance  0.03175  0.02623  0.02521 0.01865 0.01177 0.00e+00
## Cumulative Proportion   0.91815  0.94438  0.96959 0.98823 1.00000 1.00e+00

7.3.1 PCA including sample D

First, let’s have a look at the PCA including sample D:

par(mfrow=c(2,2))
screeplot(pca_12, main="Scree plot")  
colcode<-c("red","magenta","green","cyan")[factor(paste(META$dispersotype,META$month,sep=""))]

plot(pcax_12[,"pc1"],pcax_12[,"pc2"],col=colcode,pch=as.character(META$sample),cex=1.5,xlim=c(-120,120),ylim=c(-120,120),xlab="PC1",ylab="PC2")
legend("bottomleft",col=c("red","magenta","green","cyan"),legend=c("disperser August", "disperser September", "resident August", "resident September"),pch=16)
plot(pcax_12[,"pc2"],pcax_12[,"pc3"],col=colcode,pch=as.character(META$sample),cex=1.5,xlim=c(-100,100),ylim=c(-100,100))
legend("bottomleft",col=c("red","magenta","green","cyan"),legend=c("disperser August", "disperser September", "resident August", "resident September"),pch=16)
plot(pcax_12[,"pc3"],pcax_12[,"pc4"],col=colcode,pch=as.character(META$sample),cex=1.5,xlim=c(-50,50),ylim=c(-50,50))
legend("bottomleft",col=c("red","magenta","green","cyan"),legend=c("disperser August", "disperser September", "resident August", "resident September"),pch=16)

From these graphs it can clearly be derived that sample D deviates from all other samples. Given the fact that this sample was the only one with a measurable microsporidian contamination, we decided to leave it out of the remainder of the analyses.

7.3.2 PCA excluding sample D:

First, we demonstrate that the first 3 axes grossly correspond to “dispersotype”, “month” and “sex”:

par(mfrow=c(3,2))
screeplot(pca_11,main="Scree plot")
boxplot(pcax_11$pc1 ~ META_all$dispersotype,ylab="pc1",xlab="dispersotype",main="B")
boxplot(pcax_11$pc2 ~ META_all$month,ylab="pc2",xlab="month",main="C")
boxplot(pcax_11$pc3 ~ META_all$sex,ylab="pc3",xlab="sex",main="D")

boxplot(pcax$pc1 ~ META_all$month,ylab="pc1",xlab="month")
        
boxplot(pcax$pc2 ~ META_all$dispersotype,ylab="pc2",xlab="dispersotype")

Next, some plots to show how samples group in the planes of the first couple of PC’s:

colcode<-c("red","magenta","green","cyan")[factor(paste(META_all$dispersotype,META_all$month,sep=""))]
plot(pcax_11[,"pc1"],pcax_11[,"pc2"],col=colcode,pch=as.character(META_all$sample),cex=2,xlim=c(-90,90),ylim=c(-90,90),xlab="PC1",ylab="PC2",main="PCA on transformed counts for 9684 genes in 11 samples")
legend("topright",col=c("red","magenta","green","cyan"),legend=c("disperser August", "disperser September", "resident August", "resident September"),pch=16)
plot(pcax_11[,"pc2"],pcax_11[,"pc3"],col=colcode,pch=as.character(META_all$sample),cex=2,xlim=c(-70,70),ylim=c(-70,70))
legend("topright",col=c("red","magenta","green","cyan"),legend=c("disperser August", "disperser September", "resident August", "resident September"),pch=16)
plot(pcax_11[,"pc3"],pcax_11[,"pc4"],col=colcode,pch=as.character(META_all$sample),cex=2,xlim=c(-40,40),ylim=c(-40,40))
legend("topright",col=c("red","magenta","green","cyan"),legend=c("disperser August", "disperser September", "resident August", "resident September"),pch=16)
plot(pcax_11[,"pc4"],pcax_11[,"pc5"],col=colcode,pch=as.character(META_all$sample),cex=2,xlim=c(-40,40),ylim=c(-40,40))
legend("topright",col=c("red","magenta","green","cyan"),legend=c("disperser August", "disperser September", "resident August", "resident September"),pch=16)

For completeness: the entire pairs plot. Color code is the same as the plots above.

colcode<-c("red","magenta","green","cyan")[factor(paste(META_all$dispersotype,META_all$month,sep=""))]
pairs(pcax_11,col=colcode,pch=20)

And finally a 3D plot on the first 3 principal components. Dispersers in red, residents in blue.

library("plotly")
library("dplyr")
p <- plot_ly(pcax_11, x = ~pc1, y = ~pc2, z = ~pc3, color = ~META_all$dispersotype, colors = c('#BF382A', '#0C4B8E')) %>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'pc1'),yaxis = list(title = 'pc2'),zaxis = list(title = 'pc3')))

p

7.4 Results for differential gene expression:

Now we have looked into the general patterns in the data using PCA, and shown why sample D was excluded from further analysis, we can continue with looking at our inferences on differential gene expression.

First, let’s look at the distribution of Cook’s distances for all samples. Large Cook’s distances indicate that single datapoints have a disproportionally large influence on the regression.

par(mfrow=c(3,3))
boxplot(log(assays(dds_full_disp$dds)[["cooks"]]),las=2,main="log Cook's distance full model dispersotype",range=0,names=META_all$sample)
boxplot(log(assays(dds_full_month$dds)[["cooks"]]),las=2,main="log Cook's distance full model month",range=0,names=META_all$sample)
boxplot(log(assays(dds_disp$dds)[["cooks"]]),las=2,main="log Cook's distance dispersotype",range=0,names=META_all$sample)
boxplot(log(assays(dds_month$dds)[["cooks"]]),las=2,main="log Cook's distance month",range=0,names=META_all$sample)
boxplot(log(assays(dds_aug$dds)[["cooks"]]),las=2,main="log Cook's distance August only",range=0,names=META_aug$sample)
boxplot(log(assays(dds_sep$dds)[["cooks"]]),las=2,main="log Cook's distance September only",range=0,names=META_sep$sample)
boxplot(log(assays(dds_mig$dds)[["cooks"]]),las=2,main="log Cook's distance dispersers only",range=0,names=META_mig$sample)
boxplot(log(assays(dds_res$dds)[["cooks"]]),las=2,main="log Cook's distance residents only",range=0,names=META_res$sample)
boxplot(log(assays(dds_sex$dds)[["cooks"]]),las=2,main="log Cook's distance sex",range=0,names=META_all$sample)

We can conclude that generally, there is no single sample that has a disproportionally large influence on regressions for all genes. In the “sex” model, males seem to have a slightly larger average Cook’s distance.

Next, we produce MA plots. The adjusted p-value cut-off we chose is 0.05 (same cutoff as is presented in the paper, and we will use later on).

# MA plots:
par(mfrow=c(3,3))
DESeq2::plotMA(dds_full_disp$res, alpha = 0.05, ylim=c(-3,3),main="Full model, dispersotype")
DESeq2::plotMA(dds_full_month$res, alpha = 0.05, ylim=c(-3,3),main="Full model, month")
DESeq2::plotMA(dds_disp$res, alpha = 0.05, ylim=c(-3,3),main="Dispersotype")
DESeq2::plotMA(dds_month$res, alpha = 0.05, ylim=c(-3,3),main="Month")
DESeq2::plotMA(dds_aug$res, alpha = 0.05, ylim=c(-3,3),main="August only")
DESeq2::plotMA(dds_sep$res, alpha = 0.05, ylim=c(-3,3),main="September only")
DESeq2::plotMA(dds_mig$res, alpha = 0.05, ylim=c(-3,3),main="Dispersers only")
DESeq2::plotMA(dds_res$res, alpha = 0.05, ylim=c(-3,3),main="Residents only")
DESeq2::plotMA(dds_sex$res, alpha = 0.05, ylim=c(-3,3),main="Sex")

Here, it is clear that there is a lot of signal for both “month” and “dispersotype”. Notice the unbalance towards upregulation in dispersers (alphabetically “d” comes before “r”, so positive) and September (negative) in the models that were run on all 11 samples at once. Part of this unbalance disappears for the models run within month, or within dispersotype. We also notice that there is virtually no signal in between dispersotypes within August, or between months within residents.

Next, we make a summary table including some basic descriptions based on Pieris rapae annotations.

SUMTAB <- read.delim("~/pieris/loci_annots.txt",sep="\t",header=F)
colnames(SUMTAB) <- c("locus","description")
SUMTAB <- SUMTAB [!duplicated(SUMTAB$locus),]

for(dds in c("dds_full_disp","dds_full_month","dds_disp","dds_month","dds_sex","dds_aug","dds_sep","dds_res","dds_mig"))
{  
  DDS <- get(dds)$res
  names <- colnames(SUMTAB)
  SUMTAB <- cbind.data.frame(SUMTAB,rep(NA,nrow(SUMTAB)))
  colnames(SUMTAB) <- c(names,dds)
  for(i in 1:nrow(SUMTAB))
  {
    a <- DDS[rownames(DDS)==SUMTAB[i,"locus"],"padj"]
    if(!identical(a,numeric(0)))
    {   
      SUMTAB[i,dds] <- a
    }
  }  
}


SUMTAB_p<-SUMTAB

Next, we will be exploring all models and outcomes separately. my_deseq_summary is a wrapper function to summarize the output of my_deseqfun. It summarizes the stats and outputs 4 figures: + The standard output graph from DESeq2, showing shrinkage of overdipsersion along the mean of normalized counts + A density plot (smooth histogram) of the transformed count values. Here we used DESeq2::varianceStabilizingTransformation(dds, blind=TRUE) + An MA-plot, where features with an adjusted p-value for differential expression smaller than 0.05 are coloured red (these plots are equivalent to the MA plots shown above). + A volcano plot.

As a cutoff for adjusted p-value we chose <0.05 here, and for fold change we did not set a cut-off. A summary of the numbers of significantly DEGs according to two adjusted p-value cutoffs can also be found (<0.05 and <0.01) as a table in the paper.

The first two models presented, are models where either “dispersotype” or “month” are tested with the other variable as a covariate. Given the partial colinearity between both variables (caused by the September dispersers being different from everybody else), we think that the model is not able to fully disambiguate between both effects. The two following models are testing for “month” and “dispersotype” effects over the entire dataset. Also here we expect confounding effects. Hence, the models are here for completeness, but we argue that the most informative ones are actually models run on split-up data (so separately per sampling moment, and separately per dispersotype)

7.4.1 Model dds_full (dispersotype)

These are genes that are DE between dispersotypes, with “month” as a covariate.

my_deseq_summary(dds_full_disp,0.05,0)
## number of tests:
## [1] 9438
## number of p.adj <  0.05 and abs (log2FC) > 0 :
## up in d :[1] 443
## ( 4.69 %)
## up in r :[1] 251
## ( 2.66 %)

7.4.2 Model dds_full (month)

These are genes that are DE between both sampling moments, with “dispersotype” as a covariate.

my_deseq_summary(dds_full_month,0.05,0)
## number of tests:
## [1] 9626
## number of p.adj <  0.05 and abs (log2FC) > 0 :
## up in aug :[1] 664
## ( 6.9 %)
## up in sep :[1] 646
## ( 6.71 %)

7.4.3 Model dds_disp

my_deseq_summary(dds_disp,0.05,0)
## number of tests:
## [1] 9444
## number of p.adj <  0.05 and abs (log2FC) > 0 :
## up in d :[1] 492
## ( 5.21 %)
## up in r :[1] 169
## ( 1.79 %)

7.4.4 Model dds_month

my_deseq_summary(dds_month,0.05,0)
## number of tests:
## [1] 9629
## number of p.adj <  0.05 and abs (log2FC) > 0 :
## up in aug :[1] 557
## ( 5.78 %)
## up in sep :[1] 653
## ( 6.78 %)

7.4.5 Model dds_sex

my_deseq_summary(dds_sex,0.05,0)
## number of tests:
## [1] 9657
## number of p.adj <  0.05 and abs (log2FC) > 0 :
## up in f :[1] 22
## ( 0.23 %)
## up in m :[1] 25
## ( 0.26 %)

7.4.6 Model dds_aug

my_deseq_summary(dds_aug,0.05,0)
## number of tests:
## [1] 9998
## number of p.adj <  0.05 and abs (log2FC) > 0 :
## up in d :[1] 1
## ( 0.01 %)
## up in r :[1] 5
## ( 0.05 %)

7.4.7 Model dds_sep

my_deseq_summary(dds_sep,0.05,0)
## number of tests:
## [1] 9705
## number of p.adj <  0.05 and abs (log2FC) > 0 :
## up in d :[1] 1383
## ( 14.25 %)
## up in r :[1] 1112
## ( 11.46 %)

7.4.8 Model dds_res

my_deseq_summary(dds_res,0.05,0)
## number of tests:
## [1] 9495
## number of p.adj <  0.05 and abs (log2FC) > 0 :
## up in aug :[1] 77
## ( 0.81 %)
## up in sep :[1] 89
## ( 0.94 %)

7.4.9 Model dds_mig

my_deseq_summary(dds_mig,0.05,0)
## number of tests:
## [1] 9598
## number of p.adj <  0.05 and abs (log2FC) > 0 :
## up in aug :[1] 724
## ( 7.54 %)
## up in sep :[1] 940
## ( 9.79 %)

7.4.10 Is there an overlap between genes that are DEG within September, and DEG within dispersers?

We ask this question because one could consider that the comparison of dispersers in August versus September on the one hand, and the comparison of residents versus dispersers within September on the other hand, actually constitute two ways of looking at the same process. In the first case, we look at the difference between individuals that manifest (perhaps superficially) similar behaviours, but with a different endpoint (dispersal vs migration). In the second case, we compare individuals that would stay (residents) and individuals that would leave.

(Numbers presented here are a fraction off from the ones from the table in the paper because a couple of mitogenes are lacking, and because you can only check genes that are not NA for both conditions.)

table(SUMTAB$dds_mig<0.05,SUMTAB$dds_sep<0.05)
##        
##         FALSE TRUE
##   FALSE  6374 1515
##   TRUE    681  973
chisq.test(table(SUMTAB$dds_mig<0.05,SUMTAB$dds_sep<0.05))$expected
##        
##            FALSE     TRUE
##   FALSE 5832.222 2056.778
##   TRUE  1222.778  431.222
chisq.test(table(SUMTAB$dds_mig<0.01,SUMTAB$dds_sep<0.01))$expected
##        
##             FALSE      TRUE
##   FALSE 7252.8995 1250.1005
##   TRUE   887.1005  152.8995